R Update

We will mainly use two packages during this workshop: DADA2 and Phyloseq. It is recommanded that you update your R software in order that these packages run smoothly. Indeed, in order to install DADA2, you must have R version 3.5.0 or newer, and Bioconductor version 3.7. Using Bioconductor is the easiest way to install both DADA2 and Phyloseq. You can check your actual R version by executing the version command in R.

version

If your version is older than 3.5, follow the next steps. If not, skip to the Installing the packages section.


For Windows users

install.packages("installr")
library(installr)
updateR()

What is nice with this package is that it should migrate the packages you’ve installed on your older version of R to the folder corresponding to the newest version of R. It also updates all the packages you’ve installed from the CRAN repository.


For Mac users

install.packages('devtools')
library(devtools)
install_github('andreacirilloac/updateR')
library(updateR)
updateR(admin_password = 'Admin user password')

In order to migrate your packages from your old version to the new one: Follow this path: /Library/Frameworks/R.framework/Versions/ or /Bibliothèque/Frameworks/R.framework/Versions/ for the french speakers. Change the folder’s name to the 3.5.1 (latest R version).

You then want to update your packages using this command.

update.packages(checkBuilt=TRUE, ask=FALSE)

This will only update the packages installed from the CRAN repository.

In order to check if everything went according to plan:

version
packageStatus()

For Unix users

:)



Reference 1
Reference 2


Installing the packages

Now that you’ve got a brand new R, we’re going to install DADA2 and Phyloseq. For that, we must first install Bioconductor as we’re going to use one of its functions biocLite.

source("https://bioconductor.org/biocLite.R") # Installs Bionconductor
biocLite("dada2") # Installs DADA2 
biocLite("phyloseq") # Installs phyloseq 

If this doesn’t work you can try other methods. Methods for DADA2 and Phyloseq.
If you don’t succeed, we’ll help you out on the first morning of the workshop.


Vignettes

Some packages include vignettes which are documents explaining the pipeline. Use the vignette() function

library(dada2) # Loads the DADA2 package in the environment
vignette("dada2")

Getting specific information on a package or function

On R Studio you can type ?‘name’ or help(‘name’) For example:

library(dada2) # Loads the DADA2 package in the environment
?dada2 
?filterAndTrim # One of DADA2's functions
help(filterAndTrim)

If you’re just using R, use ??‘name’. It will open a page in your internet browser with the documentation on the function you want documentation on.

library(dada2) # Loads the DADA2 package in the environment
??dada2
??filterAndTrim # One of DADA2's functions
This is particularly interesting to understand the parameters the function uses


You can also see the source code of a function by just typing its name.

library(dada2) # Loads the DADA2 package in the environment
filterAndTrim


Reference 3


Setting and checking your directory

Whenever you’re launching R, it selects a default directory (i.e a file in which you can save your script, your objects). The function getwd() will give you the path to the directory you are in. In order to change it, the function setwd() is used. You have to specify the path to the directory you want. Another usefull function is list.files() which will return all the files present in the directory you’re in.

getwd()
setwd("/Users/Jack/Documents/BIOME/")
list.files()

Saving and loading your objects

When you run a pipeline or analysis, it is convenient to save the objects such as matrices or dataframes. By doing so, you won’t have to run your analysis every single time you want to view the object in question. In order to save an object the function saveRDS() can be used. The object can then be opened with the readRDS() function. This pair of functions are an alternative to the save() and load(). Their adventage is to allow the user to give a new name to the saved object when they load it.

mat <- matrix(sample(0:1, 12, replace=TRUE),3,4) # A 3 by 4 matrix containing 0s and 1s
saveRDS(mat, "Neo") # Save mat with Neo as a file name 
the.matrix <- readRDS("Neo") # Load the file Neo but it will no longer be named mat but the.matrix in your environment
identical(mat, the.matrix ) # Checks if mat and my.matrix are identical


However if you want to save multiple objects at a time you’ll have to use save()

Reference 4


The S4 object system

Phyloseq uses the S4 object system, which is one of the oriented object systems used in R. This system requires a bit of explanation. An S4 object (S4O) can be seen as multiple objects regrouped into one entity. Each object within the S4O has a class by three properties: a name, a representation giving their names and classes for example a car class would have representation(brand=“character”, horsepower=“numeric”) a charactor vector of classes that this class contains The class is created with the function setClass()

setClass("Person", representation(name = "character", age = "numeric"))
setClass("Employee", representation(boss = "Person"), contains = "Person")

a new object within the class is created with new()

Neo <- new("Person", name="Neo", age = 31)

To access slots of an S4 object, you’ll need to use @ and not $. The getSlots() returns a descritpion of the slots of a class.

Neo@age
getSlots("Person")

Reference 5


Subsampling your data

The more data you have, the slower your analyses are going to run. Therefore, it can be wise to subsample your data for this workshop. To do so, you can use the following script on the terminal. It first downloads the seqtk program. Place the program in the file where your fastq files are stored. Then you’ll use a loop in order to subsample reads within each sample. For the script to work your samples have to be named as such: SampleX_R1.fastq and SampleX_R2.fastq, X being a number identifying your samples. You will have to modify either the script or the names of your samples if they follow a different annotation. The subsampled reads will be placed in files named subsampleX_R1.fastq and subsampleX_R2.fastq (X being a number identifying your samples).

git clone https://github.com/lh3/seqtk.git; # Downloads the seqtk program
cd seqtk; make

for i in {1..10} ; # For sample 1 to sample 10 (you can adjust this to the number of samples you have)
do 
./seqtk sample -s100 Sample${i}_R1.fastq 500 > subsample${i}_R1.fastq ; 
#Takes 500 random reads in Sample${i}_R1.fastq and places them in the subsample${i}_R1.fastq file. Likewise you can adjust 500 to the number of reads you want to keep per sample. 
./seqtk sample -s100 Sample${i}_R2.fastq 500 > subsample${i}_R2.fastq ; 
# Same thing as above but for the reverse / R2 reads. It will take the reverse reads corresponding to the forward reads that have been selected above.
done

Mike Love’s Bioconductor “Refcard”

Refcard